V 



DOCUMENT RESUME 



r 



ED 228 273 • 

AUTHOR 
T I TLE 

INSTITUTION 



TM 830 122 



SPONS AGENCY 
♦REPORT NO 

PUB DATE 

NOTE 
r PUB TYPE . 

EDRS PRICE 
DESCRIPTORS 



Herman, Joan L. . 

Criteria for Reviewing District Competency Tests. / 

California Univ. 1 , Los Angeles. Center for the Study / 

of Evaluation. * - < * 

National Inst, of Education (ED) ^Washington , DC. „ * 

CSE-RP-4 

,82 

30p. V 

Report? r Evaluative/Feasibility (142) 
MF01/PC02 Plus Postage. 

♦Criterion Referenced Tests; Cutting Scores; 
♦Evaluation Criteria; Formative Evaluation; *Minimtim 
Competency Testing; Models; Performance Factors; . 
♦School Districts; Test Bias; *Testing Programs; Test 
Reliability; *Test Reviews; Test Validity 
Standard Setting 



IDENTIFIERS 



ABSTRACT " 

A formative evaluation minimum cojpnpetency test model 
is examined. The model systematically uses, assessment information to 
support and facilitate program improvement. In* terms of the model, 
f^ur iftter-relate^ qualities arp^ essential for a sourid testing 
program. The content validity perspective looks v at -how well the 
district has defined competency goals and the extent to which 
tests ^reflect those definitions — the match between district 
Objectives and test 1 items. The technical quality perspective examines 
the adequacy of the test as 3 sound measurement instrument 1 -- the . 
goodness of the test itself N Standard setting procedures look at hoV 
the district has defined acceptable .performance — the standacd for 
determining remedial needs. Finally, curricular validity looks at how 
well the district's instructional program reflect^ its objectives and 
assessment efforts — the match between tests and instruction. (PN) 



v. 



**************>*************************** 

* Reproductions supplied by EDRS 'are the best that^can be made * 

* * from the original document. • * 
**************************************************** 



U S DEPARTMENT OF EOUCATtON 

NATIONAL INSTITUTE OF EDUCATION 

tOUCATIONAl RESOURCES INFORMATION 

J CENTER IERICI 

✓"This document has boon reproduced as 

received from tho porson or organization 

originating it 

Minor changes have.bcon mado to improvo 
reproduction quality 

• Points of viow or opinions stated in this docu 
mont do not necossardy reprosont official NIE 
position or policy 



-PERMISSION \o REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



V 

CRITERIA FOR REVIEWING 
DISTRICT COMPETENCY TESTS 



Jojan L. Herma'n 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC)." 
J 



CSE Resource Paper No. 4 



v 



1982 



Center for the Study of Evaluation 

' Graduate School of Education 
University of California,. Los Angeles 



" . f • 

The project presented or reported herein was performed 
pursuant to a grant from the National Institute of 
Education, Department of Education. However, the 
opinions expressed herein do not necessarily reflect - 
the position or policy of the National Institute of 
Education, and ^o official endorsement by the National 
Institute of Education should be inferred. ^ 




v 1 TABLE OF CONTENTS 

i 

i o - 

INTRODUCTION..... • " 

CONTENT VALIDITY \ . ' 

Do the Skill's Tested Constitute Competence? 

Are the Assessed Skills Adequately Described? • • 

• Do the Test Items Match their Descriptions? 

• - ' ' ' . ■ 

Does the Item Sampling Scheme Proportionately Represent' 

the Domain of Interest? ••• •• 

TECHNICAL QUALITY. • • • 

Does the Test Provide Consistent Estimates of Student 

Performance?. . * ' 

Is the Test Sensitive to Learning and Competency?.. 

Do the Tests Measure Coherent Skills? • 

. Are the Tests Free from Bias?.. • 

STANDARD SETTING PROCEDURES. • \""" 

CURRICULAR' VALIDITY ? • • • 

REFERENCES.. ?. ^.-41 • 

APPENDIX....... • 




v INTRODUCTION 



, This paper reviews foyr. inter-related qualities that are, 

essential for a found competency testing program: ( \ • 

». . ' *■ v . . * 

1. Content validity: Do the tests measure meaningful and sig-, 
nif leant competencies? Do they clearly describe students 
status with respect to those- competencies? 

0 

2. Technical- quality: Are the test Hems technically sound, 
reliable, sensitive to instruction, and fr*ee from bias? 

3. Standa rd setting> ^rpcedures : Hertf reasonable procedures 
used to establish %\x\\ mum performance criteria? Are, the 
cut-off scores defensible? , • V 

4. Curricular validity : What is the relationship between the . 
v competency tests, district curricula, and classroom Instruc- 
tion? To what extent are the test competencies reflected in 
the instructional program? 0 ' 

These perspectives derive from commonly advanced prjncipl'e.s of . 
criterion-referenced test con&ructiQn, in general-, and of 'competency 
testing 1n particular. (See, for example, Berk, 1980^ Hambleton, • 
* 1979; CSE, 1979.) They also reflect a particular view of what pur- 
poses competency tests ought to serve and the nature of an optimal J 
assessment system. We" make. these views explicit before moving to the 

test reviews. , - \" 

v We assume that competency tests^ought to assess students' 'profi- 
ciency with regard to clearly specified dis.trict soals and that the 
results of such tests ought to be used to improve" the quality of irv^j 
structioh for students and to facilitate" student achievement. Test 
results can identify individual student needs, on the one hand and, in 
the aggregate across students, tan be used to identify areas where^ 
schoSl or district programming requires strengthening. Instructional 
efforts can then be' targeted to areas of need, through tailored ,reme- 
diaj efforts and, through more future oriented curriculum analysis and 
improvement. 



- 2 



The idea i's not that tests ought to drive district curriculum and 

\< 

Instruction, nor that teachers,, strictly speaking, ought to ' te,ach to 

" • . • * ' 'V 

the test," Rather; both testing and Instruction ought to re|l^ct sig- 
nificant, agreed upon district competency goals. Tests shguld ^asure 
• Important objectives, and classroom or other- jchool instruction should 
provide students an opportunity to attain those objectives, a vi,e4 up- 
held by recent* court discussion in minimum competency test litigltrgn. 
(See Figure 1.) ' x \ \ 

Vigure 1 % 
Minimum Competency Test Model 




District Competency 
Goals and Objectives 



1 



District y 
Competency Tests 



Test Results 
District, School 
Individual 



In 



Distryrt Instruc- 
tional Program 



>tructiona1 Implications 



Areas i'or District/School 
^Instructional Effort t 
Individual Remediation 
Needs 



ERLC 



-Figure 1 displays a formative evaluation model that systemati- % 

cajly uses assessment information to support and facilitate program 

v> r 
^improvement. The test reviews address the adequacy of the district's 

V -V 

efforts for implementing such a model as well as the integrity of the 
tests for assessing competency. In terms of the model, the content 



6 



f 

- 3 



* 

validity perspective looks at how well the district has defined compe- 
tency go^ls and the extent to which the tests reflect those defini- 
tions—the match between district objectives and test items. 'The 
technical quality perspective examines the adequacy of tfie test as a 
sound measurement instrument--,the goodness of the test itself. 
Standard setting procedures look at how the district has defined 

acceptable performance—the standard fbr determining remedial needs. 

[ t 

Finally, curricular val idity looks at how well the district's iristruc- 
tion^l program reflects its objectives and assessment efforts— the . 
{ match between tests and instruction. These perspectives^ of course, 
are qui te >inter-depei|jient; for example, content? val idity , or any other 
ind of validity, is Impossible without adequate technical quality. 
Each of the review criteria are more fully defined in the sec- *. 
jtions which folow. We make expl icit those areas where there is little 
agreement among experts and where there are problems in available 

4 ' 

methodologies; here, we offer our views of the best available 
solutions. 

CONTENT VALIDITY ' ' ,' 

Scores on competency tests *are not ends in themsetves; they are 
of interest because they indicate students 1 proficiency with regard 
to^particular skills and/or content areas— those deemed nfecessary for 
competence. Content yal idity 'asks whether ? test score is meaningful 
in this sense: -whether the test measures what it claims to measure. 
'To the extent that student scores are representative of competence^ 



\ 

the'tests are content valid. Such validity is not established statis- 
tically, but rather is demonstrated by means of wel 1 documented, 
rational, systematic, and logical judgments. The answers to several 

^ . % 

questions are germane here: m * 



1. To w hat extent do the assessed skills constitute competence? 
for example, do the objectives measured on the reading test 
fair-ly represent, thost needed, for an JndtvlduaV to be compe- 
tent? a - 

2. Are the assessed skills adequately" described? We. can make 
judgments on how well a test measures particular skills only 
insofar as" we are clear oh what the. test is intended to mea- 

•sure. A clear description enables item writers to write gpqd 
items, teachers to teach the skills that are described, and* 
tills consumers of test results, including parents and stu- 
dents,' just what was tested; the test_ description thus gives 
meaning to the test score. / • 

3. To w hat extent do the test items match their descriptions? « 
The test descriptions above* represent intentions 1ri constructs, 
ting tests. Here we ask-"Were the intentions carried out? - 
If the test items are adequately described by the test; de- 
scription, then we have a good idea of what is tested; if; 
not, then the items measure something else, and the test is 
invalid. ' • ju * 

4. Does the item sampling scheme fairly represent the^omain of 
i nterest? Does trie number or items included tor each skill 
reflect its importance relative to the total set of tested 
skills? Nymber 1 above. asks whether the skills tested are -a 
reasonable definition in a particular area, e.g., reading. 
Here we are interested in whether the total pool of items 

6 represents skills in proportion to their importance for 
competence. 

* fC 

^We take each of these content validity issues iivtu™, describe 
procedures* for optimizing content val i dity, ;and suggest actidns the 
district may want to take. 
Do the Skills Test ed Constitute Competence? 

. : i ■-' * 

There certainly is.no sure and fast definition of competence. 
The Issue is fraught with philosophical and value judgments," methodo- 
logical problems barring empirical validation and, not*surprM singly, 
.subject to vH^de interpretation. Despite these problems, however, a 
commonly accepted approach to defining competencies has evolved. This 
approach features the consensus of community and school personnel and 
subject area specialists. Typically, a committee is charged-with We 



responsibility for generating a list of ' skills' to be assessed based on 
input from teachers, administrators, parents, students, 'other Interes- 
' ted citizenry, and extant curricula and textbooks". The rn^st Important y 
• skill areas are Identified, again based on the Input of all ^Interested 
parties. Lastly, the final selection of test content 1s validated by 
surveying the opinions of teachers, parents, administrators, students,' 
subject area scholars,. and communijy^embers about the, comprehensi ve- 
neU, representativeness,, and relevance of the selected test content. 
Are the Assessed Skills Adequately Described? * 

Clear descriptions of tested skills' are essential for comp<£enty 
testing' and**ccountab1 11 ty. Optimally, content, fojpiat, response • >■ 
mode, and sampling for each skill are described thoroughly enough so 



that . ' ~ 



on the testing side: . 

a. different test writers should produce equivalent test Hems 
by following the instructions inherent in the description; 

. b. it is- clear whither any test item,_ or set of ,1 terns, falls 

/ ^ 9 

\ withifcpr outside the test domapn (CSE, 1979); 

on the instruction side: 

c* teachers can, provide instruction and equivalent .practice for 

the^sj^rT^omain; " 

it is clear whether or not instructional activities directly 
address the domains of interest; 
and*on the fairness side: 

e. the test expectations are clear td all. 
* Several approaches, to test descriptions have been advanced. 
Amonq them are item forms, amplified objectives, domain 
specifications, and mapping sentences. Domain Specifications, 



i 



\ v. 
■ v& 



(following the .work of Popham, 1971 1980; and Baker, 1974, among 
others) seem to provide an optimaT compromise between practical ity and 
technical sophistication. 

Following this approach^domain specifications are created for 
each objective or- skill tested. The domain specification includes: 
- i*. a general description, of the skill to be assessed, e.g., the 
* instructional^ objective. ^ ■ ' ^ 

2. A sample item including directions for administration. 

3. Cont ent limits , i.e., a description of the content presented 
to the students, and the critical features of tffe.task to 

> which students respond, including,, e.g. , eligible 

■m information, concepts and/or princi pi e^ftructural 
• 'm ■ " 
• constraints (length, organization), and language constraints 

\t\ (semantic and linguistic complexity). «' 

It Response limits , i.e., the nature of the required response. 

::m- ; • • 

II For selected response items rules for generating correct and 
|| incorrect alternatives are given; for constructed response^ 

i terns , criteria are incliidedVfor. judging the adequacy of 

students' 'responses. 
Several sample domain specifications are provided in the appendix 

as illustrations. , 
Do the Te^ ii Items Match Their Descriptions? 

As no%d above, the test descriptions portray a test maker's 
intentions.^ It is still necessary to verify that the' intentions were 
carried outj-and^t the test items really measure the sk+Hs they 
purports insure. While evidence that the test items .were generated 
from the detailed test specifications is one step, toward such 



verification, independent confirmation from qualified judges is also 

% i ' * • . • *• j ' ' 

highly desirable. This- confirmation process might also examine 

whether there are extraneous factorsin the test Items that detract' 

from measurement va11d1ty~e.g. , linguistic complexity, cultural blas,^ 

vocabulary level, unci ear -directions, confusing format, etc. —and that 

may confound a student's ability to demonstrate a particular skill . 

■• Does the Item Sampling Scheme Proportionately Represent the Domain of 

Interest? ' 

TWs last aspect of content validity is a simple one, but it is - 
frequently overlooked. When- judgments about a student's competency 
are made o ( n the basis of a single test score, J then decisions about the 
number of Items to include per skill objective should be based on the 
relative Importance of each objective; or^al ternatively student scores 
should bevweighted accordingly. For example, if you were to construct 
a twen|:y-iten^bst measuring four objectives of equal Importance, you 
would want to include f interns for each objective. Alternatively, 
1f two of the bbject^^^^^'ged twice as important as the remain- 

. 1ng ones, yoiTml ght wanf^ Wocate your twenty items .differently , 
e a seven 'foN^ach df Wiemore important objectives and three each 

' for the remainder. Parenthetically, it should also be' noted that the „ 
reliability of an -individual 's score on a particular skfll is a direct 
function of the number of items in the- skill objective. 1%ile no' 
absolute minimum number' of items can.be specified without* knowing the^ 
difficulty and variation of test item performance, it has been recom- 
mended that reasonably 'accurate estimates of individual abilities are 
obtained with a dozen or so items per skill objective. 



1 r ■ 



iess 



ERIC 



' . TECHNICAL QUALITY " | .' 

. ' Authorities 1» the field agree that content va>jkity and descrip- 
tive rigor are essential for a good minimum competency 'test. ^However, 
there is less agreement on the need for empirical (data bajed) valida- 
tion of tests. CSE maintains the positional bo€h types of validity 
are interdependent a*d that both are necessary to. assure test inte- . 
\rrty. Without empirical, validation, a test that appears to be con- 
ceptually sound may gi^e "measures that are no* consistent, that are^ 

■ t> ■■■ ■ ' 

insensitive .to students' competence levels, that are biased, and/or 

■ thil measure unintended skills and/or abilities. Indices of technical 
quality hel p" prevent such 'occurrences by signalling potential problems 



•in a number of areas 

Does the . 
/formance? 



1. Does the test provide. consistent estimates of.students' per- 

r . ~0 » * , 

.-J 1 

2 Is the test sensitive to school learning? Do the test items 
"'- differentiate between students who are competent and those 
who are not? 

3. Does the test measure a coherent skill? 

4. Are the tests free froniHjias, or do they seem to discriminate 
against particular subgroups? > 



Test statistics alone cannot either .discredit or guarantee test vali-. 
^ity, but they are useful for i'dentifying items or subscales that need 

. . '-.»•' , > ' 

further scrutiny. ''-"*> 

Does the Test Provide Consistent Estimates of St udent Performance? 
• ">test is consistent if the difference in a student's score on. 
two Occasions is due to a real change in achievement. If a student's-* & 
score%ia'nges as a result of poor directions, -variations i> testing ' . 
conditions, or 6ther' , irrelevant factors, then the test's scores are 
not consi stent- -an|f£he test scores do not reflect real learning or 
achievement. ; 



) 



1 o 



\ 



s * 



■ < Test-retest reliability arid alternate forms reliability are two 
indices- of -test consistency that are particularly important for mini- 
mum competency tests. Test-retest reliability indicates whether, in 

the absence of new learning, test scores.are consistent over time. 

«<* . ' - 

That is, if a test is given on two occasions, and no relevant ins.tryc- 
a * 
tion or learning occurs in the interim, then students' skill 1 eve-Is 

and their test scores should be the same. If, under these conditions-, 
test scores vary substantially, then the test does not "provide a good 
estimate of student skill proficiency. 

Alternate forms reliability .indicates the extent to which two or 
more forms of a test give parallel information. If both provide simi- 
lar estimates of students' performance, this constitutes evi (fence that 
both tests, are consistent and measure the same skill—the skill they 
are supposed to measure. 

Test-retest rel iabil itto and alternate forms reliability are 
indides that developed out of classical, norm-referenced test methodo- 
loqies. In the context of minimum competency' testing and pass-fail 
decisions, consistency needs to be demonstrated for pass-fall judg- 
ments: i.e., are pass or fail judgments consistent from one test occa- 
sion to the next, and/or do two supposedly equivalent test forms yield 
consistent pass or fail decisions? Although several methods have been 
advanced for calculating these two reliability Indices, proportion of 
agreement seems the simplest and most straightforward computation. 

A district may wish to administer its tests on two occasions , 
e.g., in two or so weeks interval, and then compute the proportion of 
agreement 1n pass-fall judgments, i.e., the proportion of students who 
were similarly classified on both testing occasions* (either pass-pass 



or fail-fail). Similarly, it would be useful to administer the "pre" 
(fall) and "post" (spring) versions of -the test simultaneously to a 
sample of students and determine the extent to which the two forms 
yield consistent pass or fail decisions. The district may also wish 
to examine the consistency of pass-fail judgments for the specific 
competencies measured by the test, particularly if the tests are 
intended to function as diagnostic tools. 

• While the reported measurement quality indice^presumably give 
some indication of alternate forms reliability a district. ala may. ' 
wish to investigate test-retest reliability for the high school profi- 
ciency tests. V 
Is the Test Sensitive to Learning and Competency? \j 
Students' scores on a test may or may not reflect actual student 
learning and may or may not accurately^portray their competency with 
respect to skills the test is intended to measure. A test which does 
provide sudif an accurate portrayal is described as sensitive to the 
phenomenon of interest— sensitive to school -learned basic compe- 
tency—and scores from such tests provide a reasonable basis for 
competency or non-competency judgments. Naturally, evidence of such 
sensitivity is important for establishing test validity. 

' Two aspects of sensitivity are <j>m^l i ed above. First, are test 
scores sensitive to what students learn in school and do they reflect 
the positive effects of instruction? For example, do students who 
have been instructed in the assessed skills outperform those who have 
not been so instructed? While qualit\ of instruction may affect the 
answer, it is important to demonstrate that test content is teachable 
and that test scores indeed reflect school learning. Otherwise, the 



utility of. test scores for school or Individual^ accountability is 

... 1/ 
negligible. 

" . A second aspect of sensitivity focuses on test accuracy in dif- 
ferentiating between masters and non-masters, or between those who are 
"competent" and those who need additional remediation. For example, 
do those -who are independently judged competent pass the test items 
and those who are not so judged fail the items? 

Similar Hfcem analyses can be^nducted to investigate both as- 
pects, although there is not yet consensus, on how to prove a test's 
sensitivity. While acknowledging that all available techniques suffer 
from some technical problems, several easy to compute alternatives are 
suggested below. These methods identify items that appear to be 
insensitive and that therefore need additional review. This 
additional review may uncover problems, e.g., ambiguous wording, poor 
dlstractors, uhf#iil iar^vocabulpy, poor construction. If such ' 
defects are discovered, items should fee -revised or discarded. 
Alternatively, closer inspection may not reveal any defect, in which 
case no revision is^ necessary. In other words, an item should not be 
rejected solely on the, basis of «an item statistic, but only when both 
the empiric^ analyses and substantive review indicate a problem. 

t To determine whether the tests are sensitive to school learning, 
the district may wish, to administer the same test at several grade 
, levels, and examine* the extent to wha^h students' scores improve with 
Instructional exposure. For example, one would expect students' 
achievement to increase over' time, especially from prior to post 
Instructional exposure. Pre=-to-post Instruction growth 1s evidence 
that a test item is sensitive to school learning. Most simply, the 



question is, do instructed students in. high grades outperform those in 
lower grades or do students' test scores increase from the beginning 
to the end of the school year? .*•'_• 

A parallel question addresses the, issue of sensitivity to minimum 
competency: do students who are clearly competent outperform students 
who are clearly not competent? There are problems in independently 
identifying students who fall into each category, and many schemes 
have been used, e.g., teacher judgments, school grades, other test 
scores. However; demonstrating that test items and passing score? 
differentiate between the competent and the non-competent is an impor- 
tant val i dity issue. 

Sensitivity indices indicate the extent to which test items « 
differentiate between criterion groups, between instructed and unin- 
structed students, (either at different grade levels, or students at 
the beginning of the school year versus the same students at the end 
of the year) and/or between masters (competents) and ncnVmasters (not 
competents) . 

Several easy to calculate statistics are based on item difficul- 
ty— the proportion of students who answer an item correctly: 

•Itemv difficulties are computed separately" for the two criterion 
group\s, and then compared to see whether there are differences in 
the expected di recti on-e. g. , the proportion of students who 
answered an item correctly on the postest mi nuT the proportion 
who answered it correctly on the pretest. One would expect con- 
siderably higher item difficulty values for the instructed or for 
the "competent" groups. 

Other, more sophisticated indices also have been developed, but 
-these cannet be calculated efficiently without a computer program. 
Doljbhe Tests Measure Coherent Skills? 

Hems on the competency tests are developed^to assess specific 



13 



competencies, and ideally there should be evidence that eacKof -these 

# 

competencies represents a coherent skill. Some believe that such 

A. * 

coherence is demonstrated by measures of the extent to which all items 
for a given skill function alike. For example, the more students' 
scores on one test item are similar to their scores on the other items 
measuring the same skill, the greater the coherence of the measure. 
Such information can supplement and help confirm content validity 
judgmental 

In practice, howwer, item coherence (or homogeneity) is often 
unrealistic, because a variety of skills may define the content of an 
instructional objective or competency > For example, a phonics compe- 
tency might deal with a variety of categories of consonants (e.g. , 

stops liquids, nasals), While student performance might be uniform 

. ' • / 

within each category, it would not necessarily be consistent across 

categories. 

A district may wish to examine item homogeneity and consistency 
within subscales to signal possible aberrant items and/or -tj help \ 
"^verify item-test description match. Factor analyses of competencies 
including at least ten items and appraisal of item difficulties within 
subscales would also be useful so long as a full range of* abilities 
exists in the data be/ng analyzed 

Are the Tests Free from Bias? 



A test is biased for a gtyen group (e.g. a particular ethnic or 

language group) of students if it does not permit them to demonstrate 
their skills as completely as it permits other groups to do so; and/or 
taps different skills and/or abilities in different groups. For obvi- 
ous reasons, bias has been a controversial issue in achievement test- 



- 14 - 
i 



1ng. It is particularly significant in minimum competency testing 
"\ because of the potential consequences of such tests. • 

Bias can be apparent in a. test in a number of ways, including 
obvious presentation* defects (e.g.^ items that disparage some groups, 
that depict solely majority customs or activities that are stereoty- 
pic, etc.), linguistics semantic problems, and socio-cul tural and 
contextual bias. A careful j£em review can -minimize the more obvious 
- " problems, but such analyjiiCkhouid.be supplemented^with statistical , 
• procedures for detecting bias., , . *» r ' V 

These statistical analyses are derived .conceptual -ly from the na- . 0 
ture of, an unbiased test: one that measures the same ^111 or abH1* 
ty< and is equally reliable and sensitive for all groups. Evidence 
that the patterns of performance are similar for all"' groups 1s one way 
to document that. a test 1s*not biased. Demonstrating that technical 
quality indices are similar for al ; l groups-e.g. , consistency, c'oher- 
. ence," sensitivity --is additional evidence that a test 1s 'relatively 
> free of bias. 



A 



STANDARD SETTING PROCEDURES 
There is no simple answer to the problem of setting reasonable 
standards for competency tests. A variety of methods' for setting pas- 
sing scores have been advanced^ but all have been criticized as at 
least somewhat* arbitrary, because all require* human judgment. But 1m- 
perfection does not obviate the need fiqr decisions, and more reasoned 
Judgments tend to produce mo re. -reasoned and defensible decisions. 

Most recent. approaches to. setting passing scores acknowledge the 
need for multiple sources of information, and coifibVne judgmental and 



•empirical' data.' Many Jdvocate input in the -judgment process from a 
broad cross-section of constituents.'' Several methods are 'described 
below to .illustrate the range of available" alternatives, j 

Several principally judgmental methods require judges to examine 
each item on a test and decide whether or not a.minima.lly competent 
student. should *be able to answer the item correctly— or some variation ' 
of such an individual item rating. Passing scores are then computed 
by averaging over judges^the total, number of correct responses that a 
minimally competent student shoul^ be able to provi'de. Most recent 
, variants of this approach require the use of pilot-test data to help 
' assure that judges' ratings are realistic. For^exaraple, judges are 
provided with item analyses from a/district pilot test to help them, 
ascertain the difficulty of the item, and whether or not a minimally 
competent person shoul d be abl e to correctly answer the i tern. An 
iterative process of rater judgments, resultant passing standards, and 
the' normative implications of *those passing standards (e..g., the • K 
percentage of high school students' likely to fail) is then used to 
arrive at a final decision. (See, for example, Jaeger, 1978.) 

Another approach to setting standards asks, raters to make judg- 
ments about mastery levels of students rather than about test items. 
Judges (most likely teachers) identify -students as "competent," 
"incompetent," or "borderline" with respect to the subject domain 
being tested. In .the "borderline group" method], the students so iden- 
tified are administered the test, and the. median test score for this 
group becomes the standard. Alternatively, in the. contrasting groups 
methods, the test is administered to students who are identified as 
clearly competent and to those who are identified as clearly incompe- 



tent. 'Score distributions are plotted for the two groups, and the 



point at. which, the two distributions intersect becomes the first esti- 
mate of the standard. This estimate can then be adjusted up or down 
to minimize different types of incision errors, i.e.^ misclassifying a 
competent student as incompetent and vice versa. 

This litter consideration is an important one, -regardless of the 
method used. Students' test scores, at best, provide Uly estimates 
of their competence. Indices such as gie standard error of measure- . 
ment provide some indication of the sjuaTity of the estimate, and/or 

the potential error incorporated into the test scores. Passing scores 

t ' 

should not be set without some consideration of measurement errors and 
likely'.classification errors. t 

CURRICULAR VALIDITY 
When a local district sets competency standards, it is defining 
the, components of an adequate education,, and is enjoining the respon- 
sibility for providing such ai» education. That competency tests 

* t 

assess skills'and objectives that are actual ly^ taught in school is 
essential to the logic and legality of any such program. If students 
are not provided with the opportunity to learn the test 'content, and 
if test content does not match what students are taught in classrooms), 
then v the system. is senseless and unfair, a view affirmed'in recent 
U.S.* court rulings in the Florida minimum competency litigation. 
8 Curricular validity focuses attention on this very important 
' requirement of minimum competency testing programs: does the^test mea- 
sure skills and objectives that are fully covered in the district cur- 
riculum? Does classroom instruction afford students relevant practice 
in the assessed skills? While these questions appear simple and 



straightforward, a. methodology for providing answers Is^only now 

emerging, and a number of Issues are yet to. be resolved. For example, 

how do you document p^ss room Instruction to demonstrate tlKstudents 

are actually exposed to the* minimum competency objectives'^ How 

similar must Instructional activities and test content be to count as 

a reasonable match? How much Instruction and practice 1n the assessed 

skills 1s sufficient to fulfill district responsibilities? ■ ♦ . ' 

Formal attempts to deal dlijpctly with the problem of match have 

developed along two different lines: detailed curriculum analyses and 

«-',<! 
teacher-based estimations.^ Approaches to curriculum analyses have 

•generally Involved comparisons between curriculum scope and sequence 
charts and test descriptions of content covered. Typically these ana- 
lyses have not Included Information on how much of the scope and ^ 

• sequence was actually covered. They have also assumed that similar 
content or topic labels mean the same thing',**. g> , inferential compre- 
hension means the same thing to both the test developer and the curri- 
culum developer. * 

More recent work has started with a detailed taxonomy of objec- 
tives 1n a subject domain. Currlcular and test coverage are then 
mapped on this taxonomy and the je*tent of overlap is ascertained. 
Following this approach one would start with domain specifications, as 
described earlier in this paper, and then examine test Items and cur- 
riculum materials to verify that the specified sMlls were indeed 
assessed by the test and included in the curriculum. Such an approach 
yields more precise estimates of curriculum coverage but is limited 
in that it considers only the formal curriculum, not teacher presenta- 
tions, nor teacher generated Instructional activities nor differing 



21 



tl 

/ 

rates of actual curriculum use. Having teachers indicate whether: stu- 
dents in their class ^jve be,en exposed to the minimum rater fa 1 ! neces- 
•sary*to pass each- ftera has been 'one response to the pjroWera, but the 

V ■■ * / * 

credibility of estimation 1n the context of minimum competency testing 

is probably suspect. . 

Providing supplementary instruction and appropriate practice 
materials for each objective covered^ on the competency tests also in- 
sures instructional opportunity for all,. and 1s clearly a necessity 
fgr remediation. Ideally these practice materials would ftie developed 
\fm the same specifications that guided test development, and/or 
coul d be,, sel ected from rel evant port1ojis_ of avail abl e curriculum 

materials. - , \ 

Clear articulation of competence across grade levels and a logi- 
cal progression of skill development further supports students' oppor- 
tunity to learn the assessed competencies. ^For example, do the read- 
ing, competencies, at grade five and grade seven include- the necessary 

. prerequisites for the required -high "Sefceol reading prof iciencies?^ 
The judgment of subject area experts might provide evidence of a 

" reasonable sequence of skills, and thus reasonable notiCre and oppor- 

tunity to learn. . 

A district ought to consider an analysis of the formal ^curricului 
—e.g., basic texts— %o determine' whether and where each assessed com 
petency is covered. Supplementary exercises could be "developed or 
located in other available materials to compensate for aj$ gaps—and 

to support remedial needs. 

U 



. REFERENCES 



Baker, E. L. Beyond objectives: Domain-referenced tests for , 
evaluation and instructional improvement. In W. Hively t£d.>, 
D omain referenced testing. Englewood Cliffs, New, Jersey: 
Educational Technology Publications, 1974. 

Berk R. A. (Ed.). Criterion-referenced measurement: The state of 

the art . Baltimore, Maryland: the Johns Hopkins university 
*■ Press, 1980. ' _^ _ ' 

Center for the Study of Evaluation. 1 CSE criterion -referenced test 
ha ndbook . Los Angeles, California: University of California, 
Center for the Study of Evaluation, 1979.' 

v * 

Hamhleton, R. K. Competency test development, vail elation and 
Standard setting. • In R. M. Jaeger 4 C.'K. Tittle (Eds.), 
Minimum competency achievement testing . Berkeley, California: 
McCutchan, 19/y. 

Jaeger, R. M. A proposal for setting a standard on the North . 
Carol ina' High S chool Competency lest .. Haper presented at the 
annual meeti ng of the North Carolina 'Association for Research 
in -Education^ Chapel Hill, N.C., 1978. 

Popham, W. J" . Criterion-referenced measurement . Eng-lewood Cliffs, 
New Jersey: Prentice Hall, iy./ri. " . 

Popham,. W. J. Domain specification strategies. In R ; A. Berk _ 
(Ed.), Criterio n-referenced measurement: The st ate of the art. 
Baltimore, Maryland: The Johns Hopkins University Press, 1980. 



21 



Sample Dpmain Specifications- 



) 



Grade Level :' Grade 3 
Subject : Reading Comprehend 



ion 



Domain 

description : 
Content . 



• Distractor 
Domain: 



Format: 



Students will -select from among written alternatives- the 
stated main idea of a given short paragraph. f 

1. For each item, student will be presented with a f-5 sen 
tence expository paragraph. Each paragraph will haver a 
stated main idea and 3-4 supporting statements. 

2; The main idea will be state* in either the first "or the 
last sentence of the paragraph. The main idea will . 
associate the subject of the paragraph (person, object, 
action) with a general statement of action, or general - 
descriptive statement. E.g., "Smoking is dangerous to 
your health," "Kenny worked hard to become a doctor, 
"There are many kinds of seals." 

3. Supporting statements will give details, examples, or 
ev-idence supporting the main idea. ,. 

4. Paragraphs will be written at no higher than a third 
grade reading level. '" 

1 -Students will select an answer from among four written 
^alternatives. Each alternative will be a complete 
' sentence. ' * 

. 2. "The correct answer will consist of a paraphrase of the r 
stated main idea. Paraphrased sentences may be accom- 
plished by employing, synonyms and/or" by changing the 
word order. 

3. Distractors will be constructed from the- following: 

a. One distractor will be a paraphrase of one supporting 
statement given in the paragraph (e.g., alternative 
"a" in the sample item). : 
rb. One - two distractors will be generalizations that 
* can be jdrawn ficom'two of the supporting statements, 
but do not include the entire main idea (e.g., alter- 
native "d" in the sample i fern). 
One distractor may be a statement about the subject 
of the paragraph that is maore general than the main 
idea (e.g., alternative "b" in the, sample item). 

Each question will be multiple choice with four 
possible responses. 



ERIC 



25 



22 



Qi recti ons: 



Sample 
Item: i 



fead each pa ragYaph . C i rcl e - the 1 ette> that tel 1 s the ' mai h 
idea. , • - . ■ 

* " Indians had many kinds of homes. Plains Indians Lived*/ 
4n tepees whisJi were made from skins. The Hopi Indians used 
pushes to make rjound houses', called hogans. The Mohawks made 
longhouses out bf wood. Some Hortheast Indians built smaller 
wooden, cabins.' , ... ' \* 



A 



What is the main idea of this story? 

a. Some Indians used skini to make houses. 

b. There wejre different Indian tribes. 

*c. Indians built different types of /houses, 
d. Indian houses Were made of wood. 



ERIC 



26 



Grade Level 

Subject : 

Domain 



Content 
L1mi ts: 



Dlstractor 
Domain: 



- 23 - 



Grade 8 

Introduction to Algebra 

Using basic operations and laws governing open sentences, 
Description: solve equations with one unknown quantity. 



1. 



2. 



3. 



4. 



7. 



Stimuli Include a number sentence with one unknown 
quantity, represented by a lower case letter 1n Italics, 
and .array of four solution sets or single answers, only 
one of which 1s correct. 

• f 

Number sentences may be statements ? pf, e^ualtles or 
Inequalities. ^f. 



The number sentences may require simplifying before 
solving by combining like terms or carrying out 
operations Indicated (e.g., by parentheses). 

Number sentences will have no more than *f1ve terms. 
Fractions may be used but not decimal fractions and non- 
dedmal fractions in the same expression. Exponents 
(powers) may appear 1n the expression only 1f they can- 
cel out and need not be solved or modified. 

Solution sets for equations and Inequalities will be 
drawn from the set of rational numbers (+). The null 
set (/) may be used as a correct solution set. 

Factoring may be a requisite operation for solving the 
equation. 

Application of the distributive property of multiplica- 
tion and the use of reciprocal values may be requisite^ 
operations for solving the equation. 

1. Distractors may be drawn from the set of wrong answers 
resulting from errors Involving any one of the following 
operations: 

a. combining terms 

b. transformations that produce equivalent equations 
(e.g. .transferring terms using the principle of 

v reciprocal values) 

c. distributing multiplication, with positive or 
negative numbers (e.g., across parentheses) ,* 

d. ' carrying out basic operations using brackets or 

parentheses 

2. Distractors may also be drawn from the set of wrong 
answers due to incomplete solution sets. 



ERIC 



27 



- 24 - 



3. Distractors may not reflect errors due to wild guessing, 
calculations Involving negative numbers, errors 1n basic 
operations. / 

. 4. "None of the above" 1s not an accepatable alternative. * 
Format : Multiple choice; ffve alternat1ve£ 

Directions : Solve the equation. Then select, the correct answer or 
~ solution set from the choices given. 

Sample Item (see directions) 

1. 8n + 2 = 2n + 38; n ».1 

i 

a) n = 3 

*b) n = 6 

c) n 3 4 

d) n = 5 
. e) n = 7.6 

2. 16x <_ 32; x =2 

a) x = 48 

*b) x • [0,1,2] ' > 

c) x = 2 > 

d) x = / 

e) x = [3,4,5,...] 



2S 



25 



Grade Level : Grade 9 

Subject: English Punctuation, 



Domain 
Description; 

Content 
Limits: 



Distractor 
Domain: " 



Correctly punctuating given paragraphs adapted from a ' 
standard eighth grade text of a practical /Informative nature. 

The student will be presented with one paragraph 1n which all 
the correct punctuation marks have been omitted, except for ; 
apostrophes 1n contractions (I'll), and possessives (&aj3^s), 
dashes, and ; semi -colons. 

For each question, students will be asked to choose aTI_ the 
correct punctuation marks which must be added 1n a given sen- 
tence to make it correct. Punctuation marks to be Indenti- 
fied and added may Include: 

4. periods at the end of a declarative or Imperative sen- - 
tence, after an abbreviation, or a* Initial 

b. q uestion marks following an 1nterr6gat1ve sentence 

c. exclamation-point after exclamatory sentences or 

interjections " ' . 

d. colon after the salutation In a business letter, or to 
separa te minutes and hours in expressions of time, and 
to show that a series of things or events follows 

e. • quotation marks enclosing a quotation or a fragment of 

it, enclosing the title of a story or poem which 1s part 
of a larger book. 

f. comma in a date or address; to set off such words as 
»ves" at the beginning of a sentence-; to set off names 
of persons or words (phrases) in apposition; to separate 
words 1n a series, direct quotations, parallel adjec- 
tives, parerithettcal. phrases; after the salutation and 
closing 1n a friendly letter; to separate a dependent 
clause and independent clause in a complex sentence. 

The alternate responses to the questions may Include: 

a. omission of punctuation mark (s) within a given sentence 
which should be included, or 

b. Inclusion of a punctuation mark or marks not necessary 
or correct in the given sentence 



ERJC 



23 



- 26 - 



Directions: 



Format: 



Sample Itemr 



The directions will be given:. )"Choose the letter which 
contains all the necessary punctuation marks 'in the given 
sentence which will make the sentence correct. 

Each question will be multiple choice, with four possible 
responses. 

1. If she starts to sing I'll crack up 2. It is funny how 
it hurts to hold back a laugh 3. I was sitting in the 
auditorium at 10:00 am and we were having a singing rehearsal 
for graduation 4. Sit up Get off those shoulders Think tall 
Sing tall Sing like this said Ms Small 5. I knew that if she- 
was going to tweet like a bird again I would laugh 6. But I 
could not laugh because Ms Small would kick me out of the 
auditorium and that meant Fel son's office—and no graduation 
7. La la la— sing children Sing with your hearts said Ms 
Small 8. I couldn't hold it 9. She was so funny I almost 
rolled. off the auditorium seat 10. The other students didn t 
laugh but me I sounded like Santa Claus 11. It became quiet 
for a second if. What are you doing Joe I know it is you. 
Pfesent yourself to Mr Felson at once that voice said 13. Ms 
Small is a foot shorter than a tall Coke but she has the bark 
of a hungry hound dog 

The first sentence should be written: 



a. 
b. 
* c. 
d. 



If she starts to sing again I'll crack up. 
If she, v starts to sing again, I'll crack up 
If she starts to, sing again, I'll crack up. 
If she starts, to sing again, I'll crack up. 



